{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.7.6",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "colab": {
      "provenance": []
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "POWZoSJR6XzK"
      },
      "source": [
        "# Query translation\n",
        "\n",
        "txtai supports two main types of queries: natural language statements and SQL statements. Natural language queries handles a search engine like query. SQL statements enable more complex filtering, sorting and column selection. Query translation bridges the gap between the two and enables filtering for natural language queries.\n",
        "\n",
        "For example, the query:\n",
        "\n",
        "```\n",
        "Tell me a feel good story since yesterday\n",
        "```\n",
        "\n",
        "becomes\n",
        "\n",
        "```sql\n",
        "select * from txtai where similar(\"Tell me a feel good story\") and\n",
        "entry >= date('now', '-1 day')\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qa_PPKVX6XzN"
      },
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
        "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
        "trusted": true,
        "_kg_hide-output": true,
        "id": "24q-1n5i6XzQ"
      },
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Create index\n",
        "Let's first recap how to create an index. We'll use the classic txtai example.\n"
      ],
      "metadata": {
        "id": "0p3WCDniUths"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
        "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
        "trusted": true,
        "id": "2j_CFGDR6Xzp",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "e65238c5-d67f-4fe4-9cbf-c88b0ff3a8bb"
      },
      "source": [
        "from txtai.embeddings import Embeddings\n",
        "\n",
        "data = [\"US tops 5 million confirmed virus cases\",\n",
        "        \"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg\",\n",
        "        \"Beijing mobilises invasion craft along coast as Taiwan tensions escalate\",\n",
        "        \"The National Park Service warns against sacrificing slower friends in a bear attack\",\n",
        "        \"Maine man wins $1M from $25 lottery ticket\",\n",
        "        \"Make huge profits without work, earn up to $100,000 a day\"]\n",
        "\n",
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\", \"content\": True})\n",
        "\n",
        "# Create an index for the list of text\n",
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
        "\n",
        "# Run a search\n",
        "embeddings.search(\"feel good story\", 1)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': '4',\n",
              "  'score': 0.08329011499881744,\n",
              "  'text': 'Maine man wins $1M from $25 lottery ticket'}]"
            ]
          },
          "metadata": {},
          "execution_count": 30
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Query translation models\n",
        "\n",
        "Next we'll explore how query translation models work with examples. "
      ],
      "metadata": {
        "id": "QTee7YMNDD4R"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from txtai.pipeline import Sequences\n",
        "\n",
        "sequences = Sequences(\"NeuML/t5-small-txtsql\")\n",
        "\n",
        "queries = [\n",
        "  \"feel good story\",\n",
        "  \"feel good story since yesterday\",\n",
        "  \"feel good story with lottery in text\",\n",
        "  \"how many feel good story\",\n",
        "  \"feel good story translated to fr\",\n",
        "  \"feel good story summarized\"\n",
        "]\n",
        "\n",
        "# Prefix to pass to T5 model\n",
        "prefix = \"translate English to SQL: \"\n",
        "\n",
        "for query in queries:\n",
        "  print(f\"Input: {query}\")\n",
        "  print(f\"SQL: {sequences(query, prefix)}\")\n",
        "  print()\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "04iOgP_ojSfK",
        "outputId": "41e73130-75e6-4ae9-b690-685972a13565"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Input: feel good story\n",
            "SQL: select id, text, score from txtai where similar('feel good story')\n",
            "\n",
            "Input: feel good story since yesterday\n",
            "SQL: select id, text, score from txtai where similar('feel good story') and entry >= date('now', '-1 day')\n",
            "\n",
            "Input: feel good story with lottery in text\n",
            "SQL: select id, text, score from txtai where similar('feel good story') and text like '% lottery%'\n",
            "\n",
            "Input: how many feel good story\n",
            "SQL: select count(*) from txtai where similar('feel good story')\n",
            "\n",
            "Input: feel good story translated to fr\n",
            "SQL: select id, translate(text, 'fr') text, score from txtai where similar('feel good story')\n",
            "\n",
            "Input: feel good story summarized\n",
            "SQL: select id, summary(text) text, score from txtai where similar('feel good story')\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Looking at the query translations above gives an idea on how this model works.\n",
        "\n",
        "[t5-small-txtsql](https://huggingface.co/NeuML/t5-small-txtsql) is the default model. Custom domain query syntax languages can be created using this same methodology, including for other languages. Natural language can be translated to functions, query clauses, column selection and more!"
      ],
      "metadata": {
        "id": "rAnEMaiWlOXm"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Natural language filtering\n",
        "\n",
        "Now it's time for this in action! Let's first initialize the embeddings index with the appropriate settings."
      ],
      "metadata": {
        "id": "P9hOcgNfjSyL"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from txtai.pipeline import Translation\n",
        "\n",
        "def translate(text, lang):\n",
        "  return translation(text, lang)\n",
        "\n",
        "translation = Translation()\n",
        "\n",
        "# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/nli-mpnet-base-v2\",\n",
        "                         \"content\": True,\n",
        "                         \"query\": {\"path\": \"NeuML/t5-small-txtsql\"},\n",
        "                         \"functions\": [translate]})\n",
        "\n",
        "# Create an index for the list of text\n",
        "embeddings.index([(uid, text, None) for uid, text in enumerate(data)])\n",
        "\n",
        "query = \"select id, score, translate(text, 'de') 'text' from txtai where similar('feel good story')\"\n",
        "\n",
        "# Run a search using a custom SQL function\n",
        "embeddings.search(query)[0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "L2DDJrd0RAaN",
        "outputId": "c60bfc29-187b-4926-8656-04063b2ca85c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'id': '4',\n",
              " 'score': 0.08329011499881744,\n",
              " 'text': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket'}"
            ]
          },
          "metadata": {},
          "execution_count": 32
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Note how the query model was provided as a embeddings index configuration parameter. Custom SQL functions were also added in. Let's now see if the same SQL statement can be run with a natural language query."
      ],
      "metadata": {
        "id": "MNd7QmFmnh-f"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"feel good story translated to de\")[0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "mnGSBNMonur2",
        "outputId": "dc14d898-b46f-42e6-88e0-434863cc15d6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'id': '4',\n",
              " 'score': 0.08329011499881744,\n",
              " 'text': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket'}"
            ]
          },
          "metadata": {},
          "execution_count": 33
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Same result. Let's try a few more."
      ],
      "metadata": {
        "id": "OnmVUK2DoZkh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"feel good story since yesterday\")[0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Js2b_M61oitg",
        "outputId": "fab080cb-d082-4cf3-8314-12d23acc3c02"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'id': '4',\n",
              " 'score': 0.08329011499881744,\n",
              " 'text': 'Maine man wins $1M from $25 lottery ticket'}"
            ]
          },
          "metadata": {},
          "execution_count": 34
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"feel good story with lottery in text\")[0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "pcxPgriwomWv",
        "outputId": "7e71dd09-7c4f-4147-ba9e-47a45c1cf90f"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'id': '4',\n",
              " 'score': 0.08329011499881744,\n",
              " 'text': 'Maine man wins $1M from $25 lottery ticket'}"
            ]
          },
          "metadata": {},
          "execution_count": 35
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "For good measure, a couple queries with filters that return no results."
      ],
      "metadata": {
        "id": "1ob4Q_CLpGBm"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"feel good story with missing in text\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LMaf7JJzonAC",
        "outputId": "71080bd1-2d9e-41ba-9501-2cdedaf26d6d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[]"
            ]
          },
          "metadata": {},
          "execution_count": 36
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"feel good story with field equal 14\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ADFNLhvto_0J",
        "outputId": "12e08ea8-7203-4fb0-cb84-1133b5db9dc0"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[]"
            ]
          },
          "metadata": {},
          "execution_count": 37
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Query translation with applications\n",
        "\n",
        "Of course this is all available with YAML-configured applications."
      ],
      "metadata": {
        "id": "mTT8nopiRdVH"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "config = \"\"\"\n",
        "translation:\n",
        "\n",
        "writable: true\n",
        "embeddings:\n",
        "  path: sentence-transformers/nli-mpnet-base-v2\n",
        "  content: true\n",
        "  query:\n",
        "    path: NeuML/t5-small-txtsql\n",
        "  functions:\n",
        "    - {name: translate, argcount: 2, function: translation}\n",
        "\"\"\"\n",
        "\n",
        "from txtai.app import Application\n",
        "\n",
        "# Build application and index data\n",
        "app = Application(config)\n",
        "app.add([{\"id\": x, \"text\": row} for x, row in enumerate(data)])\n",
        "app.index()\n",
        "\n",
        "# Run search query\n",
        "app.search(\"feel good story translated to de\")[0]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "FZ_7G6M4RUbz",
        "outputId": "b00ce1c2-8df2-4289-8d03-991174a0e74f"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'id': '4',\n",
              " 'score': 0.08329011499881744,\n",
              " 'text': 'Maine Mann gewinnt $1M von $25 Lotterie-Ticket'}"
            ]
          },
          "metadata": {},
          "execution_count": 38
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aDIF3tYt6X0O"
      },
      "source": [
        "# Wrapping up\n",
        "\n",
        "This notebook introduced natural language filtering with query translation models. This powerful feature adds filtering and pipelines to natural language statements. Custom domain-specific query languages can be created to enable rich queries natively in txtai."
      ]
    }
  ]
}